作为人类已知的最直观的界面之一,自然语言有可能调解许多涉及人类计算机互动的任务,尤其是在音乐信息检索等以应用程序为中心的领域。在这项工作中,我们探索了跨模式学习,以试图在音乐领域弥合音频和语言。为此,我们提出了Muscall,这是音乐对比的音频学习框架。我们的方法由双重编码架构组成,该体系结构了解音乐音频和描述性句子对之间的对齐方式,生成可用于文本到原告和音频到文本检索的多模式嵌入。多亏了这个属性,肌肉几乎可以转移到任何可以作为基于文本检索的任务转移到任何任务。我们的实验表明,我们的方法在检索音频时的性能要比基线要好得多,该音频与文本描述匹配,相反,与音频查询匹配的文本。我们还证明,我们的模型的多模式对齐能力可以成功扩展到零摄像转移方案,用于流派分类和在两个公共数据集上自动标记。
translated by 谷歌翻译
在过去的15年中,视网膜图像中的船只的分割已成为医学成像中的强烈研究问题,其中数百种算法发布。血管分割技术的DE事实上基准数据集之一是驱动数据集。由于驱动器包含训练和测试图像的预定义分割,因此各种分段技术的公布性能结果应提供算法的可靠排名。在该研究中包括超过100篇论文,我们对公布性能分数的一致性进行了详细的数值分析。我们发现与使用视野(FOV)相关的报告分数不一致,这对性能分数产生了重大影响。我们试图消除使用数值技术来提供偏差,以提供最逼真的现实情况。根据结果​​,我们制定了几种调查结果,最值得注意的是:尽管有明确定义的试验集,所公布论文中的大多数排名都基于非比较的数字;与文献中报告的近乎完善的准确度分数相反,迄今为止所达到的最高精度分数在FOV区域中为0.9582,比人类注释器高出1%。我们开发用于识别和消除评估偏差的方法可以很容易地应用于可能出现类似问题的其他域。
translated by 谷歌翻译
在本文中,介绍了用于音乐和音乐技术会议(CSMT)组织的数据挑战的数据集。CSMT数据挑战要求参与者识别给定的旋律是否由计算机生成或由人类组成。数据集由两个部分组成:开发数据集和评估数据集。开发数据集仅包含计算机生成的旋转,而评估数据集包含计算机生成的旋律和人类组成的旋律。数据集的目的是通过学习产生的旋律的特征来检查是否可以区分计算机生成的旋律。
translated by 谷歌翻译
We study the ability of foundation models to learn representations for classification that are transferable to new, unseen classes. Recent results in the literature show that representations learned by a single classifier over many classes are competitive on few-shot learning problems with representations learned by special-purpose algorithms designed for such problems. We offer an explanation for this phenomenon based on the concept of class-features variability collapse, which refers to the training dynamics of deep classification networks where the feature embeddings of samples belonging to the same class tend to concentrate around their class means. More specifically, we examine the few-shot error of the learned feature map, which is the classification error of the nearest class-center classifier using centers learned from a small number of random samples from each class. Assuming that the classes appearing in the data are selected independently from a distribution, we show that the few-shot error generalizes from the training data to unseen test data, and we provide an upper bound on the expected few-shot error for new classes (selected from the same distribution) using the average few-shot error for the source classes. Additionally, we show that the few-shot error on the training data can be upper bounded using the degree of class-features variability collapse. This suggests that foundation models can provide feature maps that are transferable to new downstream tasks even with limited data available.
translated by 谷歌翻译
We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirable to converge to such solutions. Our central insight is that careful designs of the optimization dynamics are critical to learning meaningful representations. We identify that a faster paced optimization of the predictor and semi-gradient updates on the representation, are crucial to preventing the representation collapse. Then in an idealized setup, we show self-predictive learning dynamics carries out spectral decomposition on the state transition matrix, effectively capturing information of the transition dynamics. Building on the theoretical insights, we propose bidirectional self-predictive learning, a novel self-predictive algorithm that learns two representations simultaneously. We examine the robustness of our theoretical insights with a number of small-scale experiments and showcase the promise of the novel representation learning algorithm with large-scale experiments.
translated by 谷歌翻译
Property inference attacks against machine learning (ML) models aim to infer properties of the training data that are unrelated to the primary task of the model, and have so far been formulated as binary decision problems, i.e., whether or not the training data have a certain property. However, in industrial and healthcare applications, the proportion of labels in the training data is quite often also considered sensitive information. In this paper we introduce a new type of property inference attack that unlike binary decision problems in literature, aim at inferring the class label distribution of the training data from parameters of ML classifier models. We propose a method based on \emph{shadow training} and a \emph{meta-classifier} trained on the parameters of the shadow classifiers augmented with the accuracy of the classifiers on auxiliary data. We evaluate the proposed approach for ML classifiers with fully connected neural network architectures. We find that the proposed \emph{meta-classifier} attack provides a maximum relative improvement of $52\%$ over state of the art.
translated by 谷歌翻译
光学相干断层扫描(OCT)是一种非侵入性的3D模态,广泛用于视网膜的眼科。在OCT上实现自动化的解剖学视网膜层分割对于检测和监测不同视网膜疾病(如年龄相关的黄斑病(AMD)或糖尿病性视网膜病)很重要。但是,大多数最先进的层分割方法基于纯监督的深度学习,需要大量的像素级注释数据,这些数据昂贵且难以获得。考虑到这一点,我们将半监督的范式介绍到视网膜层分割任务中,该任务利用大规模未标记数据集中存在的信息以及解剖学先验。特别是,一种新型的完全可区分的方法用于将表面位置回归转换为像素结构化分割,从而使以耦合方式同时使用1D表面和2D层表示来训练模型。特别是,这些2D分割被用作解剖因素,与学习的样式因子一起组成了用于重建输入图像的分离表示。同时,我们建议一组解剖学先验,以改善有限的标记数据时,可以改善网络训练。我们在使用中间和湿amd的现实世界中的扫描数据集上证明了我们的方法在使用我们的完整训练集时优于最先进带有标记数据的一部分。
translated by 谷歌翻译
在本文中,引入了一种新颖的解决方案,用于由深度学习组件构建的视觉同时定位和映射(VSLAM)。所提出的体系结构是一个高度模块化的框架,在该框架中,每个组件在基于视觉的深度学习解决方案的领域中提供了最新的最新技术。该论文表明,通过这些单个构建基块的协同整合,可以创建一个功能高效,有效的全直神经(ATDN)VSLAM系统。引入了嵌入距离损耗函数并使用ATDN体系结构进行了训练。最终的系统在Kitti数据集的子集上设法实现了4.4%的翻译和0.0176 ver/m的旋转误差。所提出的体系结构可用于有效,低延迟的自主驾驶(AD)协助数据库创建以及自动驾驶汽车(AV)控制的基础。
translated by 谷歌翻译
We study distributed contextual linear bandits with stochastic contexts, where $N$ agents act cooperatively to solve a linear bandit-optimization problem with $d$-dimensional features over the course of $T$ rounds. For this problem, we derive the first ever information-theoretic lower bound $\Omega(dN)$ on the communication cost of any algorithm that performs optimally in a regret minimization setup. We then propose a distributed batch elimination version of the LinUCB algorithm, DisBE-LUCB, where the agents share information among each other through a central server. We prove that the communication cost of DisBE-LUCB matches our lower bound up to logarithmic factors. In particular, for scenarios with known context distribution, the communication cost of DisBE-LUCB is only $\tilde{\mathcal{O}}(dN)$ and its regret is ${\tilde{\mathcal{O}}}(\sqrt{dNT})$, which is of the same order as that incurred by an optimal single-agent algorithm for $NT$ rounds. We also provide similar bounds for practical settings where the context distribution can only be estimated. Therefore, our proposed algorithm is nearly minimax optimal in terms of \emph{both regret and communication cost}. Finally, we propose DecBE-LUCB, a fully decentralized version of DisBE-LUCB, which operates without a central server, where agents share information with their \emph{immediate neighbors} through a carefully designed consensus procedure.
translated by 谷歌翻译
联合时频散射(JTFS)是时频域中的卷积算子,以各种速率和尺度提取光谱调制。它提供了原发性听觉皮层中光谱接收场(STRF)的理想化模型,因此可以作为孤立音频事件规模的人类感知判断的生物学合理替代物。然而,JTFS和STRF的先前实现仍然不在音频生成的知觉相似性度量和评估方法的标准工具包中。我们将此问题追溯到三个局限性:不同的性能,速度和灵活性。在本文中,我们提出了Python中时间频率散射的实现。与先前的实现不同,我们的将Numpy,Pytorch和Tensorflow作为后端可容纳,因此可以在CPU和GPU上移植。我们通过三个应用说明了JTF的有用性:光谱调制的无监督流形学习,乐器的监督分类以及生物声音的质地重新合成。
translated by 谷歌翻译